45 research outputs found

    GPU acceleration of Levenshtein distance computation between long strings

    Get PDF
    Altres ajuts: acords transformatius de la UABComputing edit distance for very long strings has been hampered by quadratic time complexity with respect to string length. The WFA algorithm reduces the time complexity to a quadratic factor with respect to the edit distance between the strings. This work presents a GPU implementation of the WFA algorithm and a new optimization that can halve the elements to be computed, providing additional performance gains. The implementation allows to address the computation of the edit distance between strings having hundreds of millions of characters. The performance of the algorithm depends on the similarity between the strings. For strings longer than million characters, the performance is the best ever reported, which is above TCUPS for strings with similarities greater than 70% and above one hundred TCUPS for 99.9% similarity

    GPU acceleration of Levenshtein distance computation between long strings

    Get PDF
    Computing edit distance for very long strings has been hampered by quadratic time complexity with respect to string length. The WFA algorithm reduces the time complexity to a quadratic factor with respect to the edit distance between the strings. This work presents a GPU implementation of the WFA algorithm and a new optimization that can halve the elements to be computed, providing additional performance gains. The implementation allows to address the computation of the edit distance between strings having hundreds of millions of characters. The performance of the algorithm depends on the similarity between the strings. For strings longer than million characters, the performance is the best ever reported, which is above TCUPS for strings with similarities greater than 70% and above one hundred TCUPS for 99.9% similarity.This research was supported by the European Union Regional Development Fund (ERDF) within the framework of the ERDF Operational Program of Catalonia 2014–2020 with a grant of 50% of the total cost eligible under the Designing RISC-V based Accelerators for next generation computers project (DRAC) [001-P-001723], in part by the Catalan Government under grant 2017-SGR-1624, and in part by the Spanish Ministry of Science, Innovation and Universities under grant RTI2018-095209-B-C22.Peer ReviewedPostprint (published version

    Simple real-time QRS detector with the MaMeMi filter

    Get PDF
    AbstractDetection of QRS complexes in ECG signals is required to determine heart rate, and it is an important step in the study of cardiac disorders. ECG signals are usually affected by noise of low and high frequency. To improve the accuracy of QRS detectors several methods have been proposed to filter out the noise and detect the characteristic pattern of QRS complex. Most of the existing methods are at a disadvantage from relatively high computational complexity or high resource needs making them less optimized for its implementation on portable embedded systems, wearable devices or ultra-low power chips. We present a new method to detect the QRS signal in a simple way with minimal computational cost and resource needs using a novel non-linear filter

    FPGA Acceleration of Pre-Alignment Filters for Short Read Mapping With HLS

    Get PDF
    Pre-alignment filters are useful for reducing the computational requirements of genomic sequence mappers. Most of them are based on estimating or computing the edit distance between sequences and their candidate locations in a reference genome using a subset of the dynamic programming table used to compute Levenshtein distance. Some of their FPGA implementations of use classic HDL toolchains, thus limiting their portability. Currently, most FPGA accelerators offered by heterogeneous cloud providers support C/C++ HLS. In this work, we implement and optimize several state-of-the-art pre-alignment filters using C/C++ based-HLS to expand their portability to a wide range of systems supporting the OpenCL runtime. Moreover, we perform a complete analysis of the performance and accuracy of the filters and analyze the implications of the results. The maximum throughput obtained by an exact filter is 95.1 MPairs/s including memory transfers using 100 bp sequences, which is the highest ever reported for a comparable system and more than two times faster than previous HDL-based results. The best energy efficiency obtained from the accelerator (not considering host CPU) is 2.1 MPairs/J, more than one order of magnitude higher than other accelerator-based comparable approaches from the state of the art.10.13039/501100008530-European Union Regional Development Fund (ERDF) within the framework of the ERDF Operational Program of Catalonia 2014-2020 with a grant of 50% of the total cost eligible under the Designing RISC-V based Accelerators for next generation computers project (DRAC) (Grant Number: [001-P-001723]) 10.13039/501100002809-Catalan Government (Grant Number: 2017-SGR-313 and 2017-SGR-1624) 10.13039/501100004837-Spanish Ministry of Science, Innovation and Universities (Grant Number: PID2020-113614RB-C21 and RTI2018-095209-B-C22)Peer ReviewedPostprint (published version

    Performance analysis techniques for multi-soft-core and many-soft-core systems

    Get PDF
    Multi-soft-core systems are a viable and interesting solution for embedded systems that need a particular tradeoff between performance, flexibility and development speed. As the growing capacity allows it, many-soft-cores are also expected to have relevance to future embedded systems. As a consequence, parallel programming methods and tools will be necessarily embraced as a part of the full system development process. Performance analysis is an important part of the development process for parallel applications. It is usually mandatory when you want to get a desired performance or to verify that the system is meeting some real-time constraints. One of the usual techniques used by the HPC community is the postmortem analysis of application traces. However, this is not easily transported to the embedded systems based on FPGA due to the resource limitations of the platforms. We propose several techniques and some hardware architectural support to be able to generate traces on multiprocessor systems based on FPGAs and use them to optimize the performance of the running applications
    corecore